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ABSTRACT » 

The error-correcting algorithm described was 
constructed to examine subject headings in online catalog records for 
common errors such as omission, addition, substitution, and 
transposition errors, and to make needed changes. Essentially, the 
algorithm searches the authority file for a record whose primary key 
exactly matches the test key. If an exact match is not found, the 
algorithm identifies records in the authority file, first with the 
same initial characters, or if that is unsuccessful, with similar 
endings. The heading is then examined to see if by making simple 
changes, it can be modified to match a valid record in the authority 
file. If no match can be found, even after modification, it is then 
assumed that the heading is on* of questionable validity--beir.g 
either a valid heading wit* no corresponding record in tne author 
file or an invalid heading containing extensive errors. The algorithm 
separates the subject headings into groups of valid headings, 
corrected headings, and questionable headings that require manual 
examination. Provided are one table, five figures, and 21 references. 
(Author/RBF) 
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ABSTRACT 



This report describes an error-correcting algorithm that examines the 
subject headings 1n catalog records for common errors such as omission, 
addition, substitution, and transposition errors. If such errors are 
Identified, the algorithm makes the needed corrections. The algorithm 
requires a subject heading authority file. 

The subject heading authority file contains records representing valid • 
subject headJUvgs. Each authority file record contains the subject heading, 
Its prMSrykey , and Its reverse key. The primary key 1s derived from the' 
subject heading by taking the Initial letters or digits from the heading. The 
reverse key 1s formed by taking the last letters or digits, 1n reverse order, 
from the subject heading. 

The error-correcting algorithm starts with a test subject heading whose 
validity 1s to be established. The subject heading under consideration will 
belong to one of the following classes: (1) valid subject heading which 1s 
Included 1n the authority file; (2) valid subject heading which 1s not 
Included 1n the authority file; and (3) Invalid subject heading. The 
error-correcting algorithm derives the primary key of the test subject heading 
and searches the authority file for a record whose primary key matches exactly 
with that of the test key. If an exact match 1s found 1n the authority file, 
the test heading 1s assumed to be correct. If an exact match 1s not found, 
the algorithm Identifies records from the authority file whose primary keys 
have the same. Initial characters as that of the test subject heading. The 
heading Is then examined to,see Jf, by making simple changes, it can be 
modified to match one of the vaTId records 1n the authority file. If 

modification does not produce a match, 1t Is assumed that the error lies in 
the Initial set of characters of the heading. Using the reverse key, the 
algorithm compares the heading to authority file records with similar endings. 
If no match can be found, even after modification, 1t 1s then assumed that the 
heading 1s one of questionable validity - being either a valid heading with 
no corresponding record 1n the authority file or an Invalid heading containing 
extensive errors. The algorithm separates the subject headings Into groups of 
valid headings, corrected headings, and questionable headings that require 
manual examination. 
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I. INTRODUCTION 



The Dresence of errors 1n the on-line union catalogs of bibliographic 
utliute s c as OCLC has an adverse effect on utilities then,selves and 
on the end users of their data bases. Bourne's analysis of the Impact of 
selling erro%; although he was writing from the f "text of commercial 
bibliographic search systems such as SDC and BRS 1s still valid for on-Hne 
union catalogs. [1] For the bibliographic utilities, the negative w 
?on?ea5en?es are, following Bourne: "(1) extra computer time, storage space 
a^aSaLa ^osts...; (!) damage to 1mage/cred1b1lity/marketab ^y [of the 
bibliographic utilities]; [and] (3) less effective service than s otherwise 
poss ble. g The end users of the catalog records are forced to divert some of 
their resources 1n terns of personnel, time, and communication costs, to 
"cleaning up" the records before they can be used. 

For OCLC users, errors 1n the subject heading fields currently are only a 
minor nuisance. that can be overcome either by editing the record before 
producing cards or by simply Ignoring the errors when the subject cards are 
filed. However, 1n the near future when computer ze d systems play a major 
role In providing subject access, these errors will have to be taken into 
^IdSS "Liters, are not as forgiving ol ; errors as are h«ja«. With 
computerized subject access, errors 1n the subject heading fields 
frequently result 1n the records being Jnaccessable and, Spending on the 
retrteva technique employed, may make the search much more difficult. In 
Itew^f thefe prSSlems! bibliography utilities have to devise effective and 
efficient means of Identifying and correcting common errors 1n the subject 
headings. 



A. Scope o* Study 

The algorithm described here 1s a by-product of a research project on the 
Library^ of Congress subject headings reported by O'Neill and Alorf. [2] The 
orlSlSt project was undertaken to examine the distribution patterns of LC 
cuMort headlnas 1n the OCLC catalog records and to study the Information 
contend Sf iblect headings assigned. During the course of this project, 
ZlVer the presence of numerous misspellings and Inconsistent spacing, 
SSStlorf and capitalization practices could not be overlooked. The 
rSS$Ml In 'of the need for correcting such common errors led to the design or 
the proposed error- correcting algorithm. 

The original project on LC subject headings was conducted on a-sample of 
33 455 catalog records 1n the OCLC data base. The sample contained ey^ry fu 1 
ievefno juveSne monographic record 1n the OCLC data base whose ^LC^jtrol 
nunber ended with "96," as of September 2, 1978. Of the 33,455 records In the 
sample, 7/PO were received from the Library of Congress through Its MARC 
Paginal no .51 strlbutlon Service and the remaining 25,965 were cataloged 

O f OCL C woSber 1 ibrari es . A total of 50*213 subject heac inaj ; occurred 
U the sample of which 47,036' were Library of Congress subject headings. 
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B. Subject Heading Errors 

An examination of subject headings' extracted from the 33,455 monographic 
records Inthe OCLC data base showed that a significant number of the headings 
contained various types of errors. The majority of the errors was 
typographical and fell Into one of four major categories: [3,4] ^ 



"Antomy, Human" (Instead of "Anatomy, Human") 1s an example of an omission 
error where a character was Inadvertently dropped. "Geographty" (Instead of 
"Geography") 1s an example of an addition error where an extraneous character 
was added. Substitution errors, as 1n "Hard-co;rd unemployed" (Instead of 
"Hard-core unemployed"), have one character replaced by an Incorrect 
character. "Commerlcal law". (Instead of "Commercial law") Illustrates a 
transposition error where a pair of characters 1s transposed. Although some 
subject headings contained multiple errors Involving multiple characters, most 
of the incorrect 'subject headings contained only a single error Involving a 
single character or one pair of characters. 

The sample of subject headings also contained several spacing, 
punctuation, and capitalization Inconsistencies. Examples are: 

U.S. U. S. (spacing) 

Postage-stamps Postage stamps (punctuation) 
Congresses congresses (capitalization) 

While these Inconsistencies are relatively unimportant 1n manual card 
catalogs, they become significant 1n a computerized catalog. For Instance, 
computer software treats "O.T.- 4 and "0. T." as different character strings and 
hence as different headings. 

Finally, the subject headings sample contained a large number of 
Inconsistent abbreviations. Although, strictly speaking, abbreviations are 
not errors, the absence of standardization in their use may cause retrieval 
problems. For example, the subdivision "Description and travel" appeared in 
the sample in the following forms: 



(1) omission errors, 

(2) addltton errors, 

C3) substitution errors, 

(4) transposition errors. 



Descr. and trav. 
Description and travel 
Description A travel 



Descr. A trav. 
Descr. A travel 
Desc. A trav. 
Desc. A travel 
Desr. A trav. 



Each of these strings is distinct to the computer. 
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Typographical errors, variations 1n punctuation, and variations 1n the use 
of abbreviations become serious problems as the size .of the lata base 
Increases. It Is conservatively estimated that 1% of OCLC records contain 
errors in their subject heading fields. Assuming the .number of errors 
Increases linearly with' the size of the data base, one can estimate that when 
the OCLC data base grows to 10 million record's, 1t could contain' 100,000 
catalog records with errors 1n subject headings alone. Many of these records 
may be Inaccessible to the user through normal retrieval mechanisms. ' 
Consequently, these typographical and variation errors must be Identified and 

corrected. . 

." - . \ 

------- T „ 

C. Objective of the Report r 

This report addresses the presence of errors and variations, ar.d the need 
to correct and standardize subject headings for Information retrieval. An 
error-correcting algorithm 1s described that automatically corrects a large 
•percentage of the typographical errors. In addition, 1 the algorithm Identifies 
subject headings that may contain errors, regardless of their cause. 

The proposed error-correcting algorithm Is Intended to be conservative. 
'That Is, It Is designed to correct relatively simple errors and to Identify 
complex errors for scrutiny by human editors. At the same time, the algorithm 
produces a 11st of corrected subject headings to permit the editors to check 
that 'the algorithm 1s not altering valid headings. 
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II. ERROR-CORRECTING ALGORITHM 



A fairly large body of literature exists on the detection and automatic 
correction of spelling errors. [5-14] According to Zamora, the techniques 
described 1n the literature fall Into three categories: M (l) Isolation of low 
frequency words, (2) dictionary look-up, and (3) n-gram analysis, where an 
n-gram 1s a string of n characters extracted from a word. [15] The 
dictionary look-up method 1s the most appropriate for correcting spelling 
errors 1n subject headings Jn the records of" the online union catalogs of 
^bWogr^lc-^IM-ti^s^Herej-the^dlctl onary-that-couW be-used-f or~sub ject — 
heading verification and correction 1s a 11st of authorized subject headings. 
Because the on-line union catalogs of the bibliographic utilities are created 
and maintained through the cooperative efforts of a large number of libraries, 
there 1s already a need for the development -of various authority lists (such 
as subject heading and name authority lists). Once suchauthorlzec lists are 
^developed, they should ldglcaTlly be used "in 'detecting and correcting errors. 
In this connection, Zamora's observation that "the dictionary look-up • 
technique has the most favorable ratio of misspellings to words^agged when 
applied to the CAS (I.e., Chemical Abstracts' Service) data base 1s 
encouraging. [16] - • 

Morgan describes the two stages of the error-correcting. algorithm of the 
dictionary look-up technique. [17J Irr the case of subject headings, Morgan s 
algorithm begins, with two key el ements? (1) a test subject heading whose 
validity Is under question, and (2) a valid subject heading that belongs to 
♦ the; authorized 11st of subject headings. The two basic .stages of dictionary 
look-up technique, as described by Morgan are: ^ « 
■ - , , t<? 

(1) selecting a subset of val.ld subject headings from the authority 11st, 
where the subset of. subject headings contains nearly all headings of 
which the test heading may be a misspelling; and 

• s 

(2) comparing, palrwise, each of the valid subject headings with the test 
subject heading to determine whether or not the test heading 1s a 
misspelling of the authorized heading. 

Following Morgan, the error-correcting algorithm described 1n this report 
contains three elements: 

(1) creation of a subject heading key corresponding to each of the valid 
and test subject headings, where^the subject heading keys, rather 
than the subject headings theatfelves, are used 1n prTrwIse comparison 
between valid and. test subject headings;. \ 

(2) creation of a subject heading authority file that would be the source 
of valid subject headings; and 

- ' (3) an error-correction routine that selects the subset of valid subject 
headings against which the test headings are compared, compares the 

t • . 
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test subject headings with the valid subject headings for possible 
errors, and then corrects the test headings, 1f errors are detected. 



The daslgn of the error-correcting algorithm 1s based on the following 
.observations: 

(1) Over 90% of the subject headings 1n the OCLC records are Library 
of Congress (LC) subject headings. [18] LC subject headings are 
controlled vocabulary and form the authority 11st from which LC 
and most OCLC member libraries draw the headings for assignment 
to catalog records. In other words, the predominant use of LC 
subject headings limits both the number of subject headings and 
their variations which can occur 1n the OCLC records. 

(2) If the subject headings are arranged according to their 
frequencies of occurrence, the headings which occur least 
frequently contain the largest number of typographical errors. 
[19] 

(3) The typographical errors fall Into one of the four categories 
'Identified 1n papter I of this report: dropped characters, 
excess characters, characters substituted for others, and pairs 
of transposed characters. 

(4) In a majority of cases, subject headings contain only one error 
involving only one character or one pair of characters. 

The first two observations are used to create an authority file consisting of 
•good" subject headings. The last two observations are used to compare 
potentially erroneous headings with those 1n the authority file and to correct 
the errors t 



A. Subject Heading Key 

Much of the manipulation of subject headings 1s done on a key constructed 
from the subject headings- The^subject heading key, which- contains^ 
characters, 1s made up of one character Identifying the type of subject 
heading followed by 27 characters derived from the heading. Topical subject 
headings are assigned the Identifying character "1," geographic headings the 
character "2," etc., as shown 1n Table 1. The derived portion of the key 
contains the first 27 characters of each subject heading. . 




Table 1. Types of Subject Headings and Subdivisions 
Identified in the Keys 



First Character 
of a Key 


Type of Subject 
.Heading or Subdivision 


1 


Topical Subject Heading 


2 


Geographic Subject Heading 


3 


Personal Name Subject Heading 


4 


Corporate Name Subject Heading 


5 


Conference/Meeting Subject Heading 


6 


Uniform Heading Subject Heading 


X 


General Subdivision 


Y 


Period Subdivision 


Z 


Place Subdivision 

*> 



* 



All letters are capitalized to eliminate the differences caused by 
variations 1n capitalization. If the subject heading contains numeric 
characters, all occurrences of the digit M 1 M are converted to alphabetic 
character "L. M This 1s done to compensate for the common confusion between the 
digit M 1 M and the lowercase letter "1." The subject heading key Ignores 
special characters, punctuation, spacing, and capitalization to compensate for 
minor variations In the subject headings. The key thus constructed eliminates 
a large number of common variations In the subject headings. Therefore, the 
key can be used to group together many of the variants of a subject heading. 
Examples of variant forms of subject headings and their keys are: 

Greco-Turkish War, 1921-1922. 

Greco- turklsh war, 1921-1922: 

Greco- turklsh War, 1921-1922. 

Greco-turklsh War, 1921 - 1922. 



IGRECOTURKI SHWARL92LL922 
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Freedman 1n Beaufort co., S.C. ^ 

Freedman- 1n Beaufort Co., S.C. } 1FREEDMANINBEAUF0RTC0SC 
Freedman In Beaufort co^., S. C.J 

The digit "1" 1n the first-character position 1n the preceding keys 
Identifies the keys as topical subject headings. Blanks are used at the end 
to fill the key out to 28 characters. In the case of subject headings 
containing subdivisions, separate keys for main headings and subdivisions are 
maintained. For example; 1f the subject heading 1s: 

"650 Wf English fiction $y 19th century $x History and criticism" 
the following three keys are derived: 
1ENGLISHFICTI0N 

YL9THCENTURY „ 
XHISTORYANDCRITICISM 

As with the main headings, the first character 1n the subdivision key 
Identifies the type of subd1v1s*$h. However, for subdivision^ an 
alphabetical character 1s used. In the above examples, the X and t 
preceding the second and third keys Identify general and period subdivisions 
respectively. Table 1 shows the Identification characters used 1n the keys 
and the corresponding types of subject headings or subdivisions. 

A reverse key also 1s derived from the subject heading for use by the 
error-correcting routine. Reverse keys are made up of the last 14 nonblank 
characters, 1n reverse order, fror,, the primary 28-character subject heading 
k^y Sr example, If a subject heading 1s "Self-Instruct on " Its prim ary key 
would be "1SELFINSTRUCTI0N" and Its reserve key would be NOITCURTSNIFLE. 
The primary subject heading key 1s used to locate subject headings In the 
authority file and to check for errors 1n the second half-segment of the 
heading; the reverse key 1s used to check for errors 1n the first half-segment 
of the heading. 

For very long headings, some characters will be dropped 1n forming the 
kevs For example, the subject heading "Information storage and retrieval 
systems" loSSTRIJe as Its primary key " 1 INFORMAT I ONSTORAGEANDRETRI E " s 1 nc ® 
only the first 27 valid characters from the heading could be used. The length 
of this key would be 38 since the dropped characters would be counted The 
averse key would be formed from the last- 14 valid characters from the u 
reading. For this example, the reverse key. would then be SMETSYSLAVEIRT. 



B. Authority File 

The authority file 1s created based on the assumption that frequently 
occurring subject headings will be valid. It 1s further assumed that the 
catalog records Issued by the LC MARC Distribution Service (LC records) are 
less likely to have typographical errors than those contributed by the OCLC 
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member libraries (contributed records). Once these assumptions are accepted, 
•good' subject headings can be defined as those whose frequency of occurrence 
In LC and contributed records equals or Is greater "than an arbitrary number, 
while, at the same time, occurring at least a set number of times 1n LC 
records. All subject headings which satisfy these requirements are placed 1n 
the authority file. The more stringent the requirements for Inclusion of 
subject headings In the authority file and the larger the number of subject 
headings on which the authority file 1s based, the more confident one can be 
that these subject headings are free from typographical errors.* However, it 
is not always necessary to create an authority file by this method. For 
instance, an organization, such as LC, might distribute a mach "^able 
authority file of subject headings. Such a file, once carefully checked >r 
errors, would also be suitable for this algorithm. 

Each record- 1n the authority file consists of Information on a main 
subject heading or subdivision. Figure 1 shows an authority record for the 
lubjlct heading, "Copyright, International." The first 23 characters In the 
authority record represent the primary "key for the subject heading. -The next 
14 characters 1n the record represent the reverse key. Following the reverse 
key are three fields which show: (1) the number of nonblank characters 1n the 
primary key Including any dropped characters, (2) the length of the subject 
heading or subdivision, and (3) a type of record Indicator showing whether the 
record represents an entry for a valid heading or an entry for an abbreviated 
heading. When the type of record Indicator 1s It Indicates that the 
record represents a valid heading. The last field 1n the authority record 
contains the actual subject heading or subdivision. 

When the type of record Indicator 1s "1," 1t Indicates that the record 
represents an entry for an abbreviation. Entries for abbreviations act as 
"see" references. Figure 2 shows the authority record for an abbreviation. 
Th first 28 characters represent the primary key for abbreviated heading, 
e.q., "H1st. & Cr1t." which 1s a general subdivision. The primary key Is 
followed by three fields as 1n the case of a valid heading. The type of 
record Indicator, however, is "1," showing that the entry- 1s for an 
abbreviated heading. The type of record Indicator 1n this case 1s followed by 
the primary key. for the unabbreviated version of the heading. 
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28-character Primary Key 



Number of Characters in 
the Subject Heading or 
Subdivision 



14-character Reverse Key 



Abbreviation 
Indicator 




First Character of Key- 
Indicates the Type of 
Subject Heading or 
Subdivision 



Number of Ncnblank Main Subject Heading 
Characters in Primary or Subdivision 
Key 



Figure 1. Authority File' Record 



17 



10 



28-character Key for 
an Abbreviated Subject 
Heading or Subdivision 



Number of Characters 
in the Primary Key 
for the Unabbreviated 
Version of the Subject 
Heading or Subdivision 




Primary key for 
Unabbreviated Version 
of Subject Heading or 
Subdivision 



xmsKRimmmmmwmmmismmDCRmcisn 



First Character of Key- 
Indicates the Type of 
Subject Heading or 
Subdivision 



Number of Nonblank 
Characters in the 
Key for an Abb¥eviated 
Subject Heading or * 
Subdivision 



Type of Record 
Indicator 



Figure 2. Authority Record for an Abbreviated Subject Heading or Subdivision 
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The authority records for abbreviations, In contrast to those for valid 
subject headings and subdivisions, have to be manually Introduced Into the 
file. This, however, should not present a serious problem as 1t Is not 
difficult to compile a 11st of commonly occurring abbreviations 1n cataloging 
records. When the error-correcting algorithm comes across a key for an 
abbreviated heading, that key can be automatically replaced by the key for the 
unabbreviated version of the heading. 

The authority file 1s an Indexed file sequenced on Its primary key 
(Figure 3). The Index sequential organization brings together subject 
headings and subdivisions starting with the same Initial characters. Records 
1n the authority file can be accessed directly using the complete primary key. 
To determine 1f a given heading 1s 1n the authority file, the key for the 
heading would be derived. If a record with exactly the same key exists In the 
authority file, the corresponding record would be retrieved. When no match Is 
found, 1t would be known that either the heading 1s new or that the heading 1s 
Invalid. 

The authority file can also be accessed directly Ui ng only the Initial 
portion of "the primary key. If the authority file shown 1n Figure 3 was read 
using the key "1S0DIU," no exact match would be found but the file would be 
positioned so that subsequent sequential read operations wuld retrteve the 
records corresponding to the headings "Sodium," "Sodium sulphate, Softball, 
etc. 

If only the end- of a heading 1s known, the primary key Index 1s of no 
assistance. To access the authority file by the reverse key, a second 
nonsequential Index to the authority file 1s required (Figure 4). This 
nonsequential Index permits access by the reverse key or by any portion of the 
reverse key. If the authority file shown in Figure 4 1s read using the key 
"SCISYHPD," no exact match would be found but the file would be positioned. 
Subsequent sequential reads through the nonsequential Index would retrieve the 
authority records for "Cloud physics," "Scattering (Physics), Medical 
physics," etc. When 1t 1s assumed that there 1s an error 1n the first half of 
the heading, the nonsequential reverse key Index must be used to access the 
authority file. 
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Figure 3. Sequential Index Organized by Primary Key 
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Figure 4. Nonsequential Index Organized by Reverse Key 



_C . .Error-correcting Procedure 

The error-correcting procedure Is performed on subject headings In the 
unchecked subject headings file. This unchecked file Includes all those 
subject headings that must be tested for accuracy. The procedure consists of 
two operations: (1) check to see If an exact match Is found, and (2) 1f not, 
make corrections to the headings, If possible. 

e» — — — — w # 

Using the error-correcting procedure, the subject headings under 
consideration are compared with those In the authority file to detect and 
correct errors. If there Is an exact match between a subject heading 1n the 
unchecked file and one In the authority file, the heading In the unchecked 
file Is'accepted as valid. If no match Is found, the heading In the unchecked 
file Is- examined for errors. If the algorithm falls to find a match between 
the heading In the unchecked file and those headings In the authority file 
even after corrections are made for possible typographical errors, the 
unchecked file heading Is transferred to a "questionable" subject headings 
file for manual review. Figure 5 presents a diagrammatic representation of, 
this procedure. 

The subject headings In the unchecked file are matched with those in the 
authority file through two operations. In the first operation, characters in 
the first half of the test heading are assumed to be correct and the remaining 
characters are examined for errors. In the second operation, the process is 
reversed; the characters in the second half of the heading are assumed to be 
correct and the first half of the heading is checked for errors. 
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Figure 5. Error-correcting Algorithm 

1. Check for Errors in the Second Half of the Subject Heading 

When checking for errors in the second half of a heading, a certain number 
of characters counted from left to. right, depending upon the length of the key 
of the test subject heading, are assumed to be free from typographical errors. 
These characters are referred to as the "truncated key." The truncated key is 
used as an entry point into the authority file to locate the relevant subject 
headings. Then, the keys of these potentially relevant subject ne«u.ngs are 
matched with those of the test subject heading to identify typographical 
errors. 

The truncated key is created from the original key of the test subject 
heading using the formula: 

Length of the truncated key * Key length - 1 

z 
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Thi^praul a j_s^^ theJcey contains 6 or-fewer characters. If the 

key contains an even number of characters, truncated key length 1s obtained by' 
rounding down the value obtained from the preceding formula. That 1s^1f the 
length of the key 1s 10 characters, the formula gives the length of the 
truncated key as (10 - 1) / 2 = 4.5 which 1s then rounded down to 4. This 
means that the truncated key of the test subject heading contains 4 characters 
that are assumed to have no typographical errors. • 

To Illustrate this truncation further, consider the 11-churacter key 
"1 I NVE NITONS" of a test subject heading. The truncated key consists of 5 
characters, "IINYE." These 5 characters, assumed free from typographical 
errors, are used to Identify the. potentially relevant subject headings 1n the 
authority 'rile. By comparing the relevant headings identified by the 
truncated key with the test subject heading, the error-correcting algorithm 
checks for any spelling error 1n the last 6 characters, "NITONS, M of the test 
subject heading. 

Not all subject heading keys whose initial characters agree with the 
truncated key af the test subject heading are checked for errors. The only 
subject heading keys checked for errors ^re; (1) those whose initial 
characters are the same as those of the truncated key; and (2) those whose u 
total lengths are within one character of the lengths of the keys of the test 
subject headings. The length of the subject heading keys to be checked 1s 
restricted because the error-correcting algorithm attempts to correct only the 
following types of errors: 

(1) one excess chajgcter, 

(2) one dropped character, 

(3) a character Incorrectly substituted, 

(4) a transposition error* 

In the case of the error of an excess or dropped character, keys to be ' 
compared have either one character more or one less than those of the keys of 
the test subject headings. In the case of the substitution or transposition 
error types, lengths of the keys of test subject headings are equal to those 
of the correct subject headings. If the lengths are equal, the keys are 
compared to determine the number of characters which do not match. If only 
one character does not match, then 1t 1* assumed to be a character Incorrectly 
substituted. If two adjacent characters do not match, the test key 1s a . 
candidate Kr a transposition error. [20] When more than two characters do not. 
match, no automatic attempt 1s made to correct the heading. 

As an Illustration, let us again consider the test key, "lINVENITONS.'^Its 
truncated key Ms "lINYf^ 1 This truncated key identifies the following keys in 
the authority file as potentially relevant: 
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LENGTH 


1INVENTI0NS* - 


11 


INVENTORIES* 


12 


1INVENT0RS* 


10 


1 INVERSE 


8 


1INVERTEB RATES 


' 14 


INVESTIGATION 


14 


INVESTIGATIONS 


15 


1 INVESTMENT* 


11 


INVESTMENTS* 


12 



Since the length of the key of the test subject heading 1s 11 characters, only 
those keys whose lengths are 10, 11, or 12 characters are targeted fcr 
comparison with the test key, Jhe target keys 1n the preceding Us* ire 
Identified by asterisks (*). The error-correcting algorithm compares 
"1INVENTI0NS" and "INVESTMENT" (each 11 characters) with the test key 
"1INVENIT0NS" for possible replacement and transposition errors. Similarly, 
the algorithm compares "1 INVENTORIES" and "INVESTMENTS" (each 12 characters) 
with "1INVENIT0NS" for a dropped character 1n the test key. T*ie algorithm 
then compares "1INVENT0RS" (10 characters) w*th "1INVENIT0NS" for an added 
character fn the test key. 

Whfrn the error-correcting algorithm discovers that a test subject heading 
differs from that of an authority file heading only 1n numeric^ characters, the 
algorithm will, not alter the test heading. The reason 1s that the algorithm 
takes advantage of the natural redundancy '1n the subject headings. However, 
this redundancy does not exist 1n the case of jiumeHc characters. For 
instance, changing the following subject headings from one to the other would 
result In Incorrect headings: 

IBM 360 .(computer) IBM 370 'computer) 

Piano music (3 .hands) Piano music (4 hand?) 

United States-Economic Pollcy-1961 United States-Economic Policy-1971 

In any case, given the widespread occurrence of such headings which differ 
only slightly. 1n numeric characters and their unpredictability, the 
error-correcting algorithm dees not change the numeric characters. 

• There may be some headings for which the primary key logically should hsve 
more than '28 characters. The number of nonblank characters 1n the key before 
truncation 1s Included as part of the authority record. Whenever this value 
exceeds 28, the primary key win be Incomplete. Since 1t will always have the 
Initial characters and the. character count, the Incomplete key does not pose 
any problens 1n Identifying the target keys. However, the Incomplete key 
cannot be compared to the test key* For this purpose, the full key must be 
rederlved from the subject heading or subdivision contained 1n the authority 
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Thus, the error-correcting algorithm compares all the subject headings In 
the test file with those 1n the authority file and splits the test file Into 
two separate files: U) valid headings, and (2) questionable headings, valid 
headings are those for-whlch there 1s a corresponding heading 1n the authority 
file and hence which are assumed to be free from errors; or headings for 
which, after correcting a typographical error, the algorithm found a match 1n 
the authority file. The questionable headings file consists of those headings 
for which no match could be found 1n the authority file even after correcting 
potential typographical errors. The headings 1n this file are then checked 
using the reverse key. 

2. Ch eck for Errors 1n the First Half of the Subject Heading 

The algorithm described 1n the previous section would not work If the 
error occurs In the- Initial characters of the subject headings. If the key of 
the test subject heading. 1s "1INEVNTI0NS" Instead of "UNVENITONS," the 
program would have difficulty In Identifying the potentially relevant keys 1n 
-the authority file. For this purpose, the reverse keys for all headings In 
the authority file are utilized. For example, for the keys INVENTIONS and 
"1 INVESTIGATIONS," corresponding reverse keys are "SNOITNEVNH" and 
"SN0ITAGITSEVNI1." These reverse keys form the basis for correcting errors 
which occur "In the first half of the test subject heading. 

If the key of a test subject heading 1s "1INEVNTI0NS," the algorithm 
reverses the key to "SN0ITNVENI1." The remaining correction procedure for 
deriving a t^ncated key, Identifying the potentially relevant keys 1n the 
authority f1>e, Identifying the target keys, and performing the final 
correction process Is the same as that described 1n the previous section. The 
truncated reverse key 1s "SNOIT" which Identifies the following entries as 
potentially relevant: 



PRIMARY KEY X REVERSE KEY LENGTH 



1DISL0CATI0NS SN^KACOLSIDl 

INVESTIGATIONS SNOITAGHSEVNI 

1DIFFUSI0N0FINMOVATI0NS SNOITAVONNIFON 

1MEDICALINNOVATI0NS SNOITAVONNILAC 

1EDUCATI0NALINNOVA f IONS SNOITAVONNILAN 

1AGRICULTURALINN0VATI0NS SNOITAVONNILAR 

INJECTIONS* SN0ITCEJNI1 

FUNCTIONS* - ' SN0ITCNUF1 

1HYPERFUNCTI0NS SNOITCNUFREPYH 

1INJUNCTI0NS* SM0HCMUJNI1 

• HMVEHUOHS* SN0ITHEVNI1 

XINVENTIONS* SNOITNEVNIX 

Since the length of the test subrject heading. key 1s 11 characters, onlf those 
kevs whose lengths are 10.11, o!r 12 characters need to be compared with the 
test key. Once the targetleyr (marked with asterisks In the preceding 11st) 
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are Identified, the procedure proceeds exactly the same way as when the target 
keys were Identified using the primary keys. After changing the EV in the 
Hext key to "VE, M a match would be found and the correct form of the heading 
would be assumed to be "Inventions. " 



D. Testing the Algorithm 

To test the error-correcting algorithm, a COBOL program was developed for 
the Sigma 9 computer at OCLC. The test was limited to form subdivisions since 
a relatively complete list of form subdivisions had been compiled as a part of 
the study on subject heading patterns. [21] The 11st was available In 
machine- readable form and was used as an authority file for form subdivisions. 
Form subdivisions extracted from bibliographic records were then checked using 
the error-correcting algorithm. The final version of the algorithm described 
above successfully corrected all omission, addition, substitution, and 
transposition errors. Subdivisions containing abbreviations wer v e also 
successfully changed when a record for the abbreviated subdivision was 
Included 1n the authority file. Form subdivisions containing more serious 
errors or multiple errors were Identified but were not changed. There were no 
cases where a valid subdivision was modified. 
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; HI, LIMITATIONS OF THE METHOD 



The success of the error-correcting algorithm depends on the 
comprehensiveness of the authority file. If the authority file 1s Incomplete, 
the routine may change correct subject headings to other headings. Examples 
of such troublesome headings are: 

Adaption Adoption 
Painting Paintings 

The greater the number of subject headings 1n the authority file, the greater 
the probability of avoiding erroneous corrections. For Instance, 1f both 
* "adaption" and "adoption" are Included 1n the authority file, an erroneous 
correction would not occur. 

There are a number of Instances where the subject headings contain 
multiple errors as well as those Involving more than one character or a pair 
of characters. Examples of such errors are: t 

Distribution (Probability theory) 
Education accountanlHty 

" The algorithm described herein does not attempt to correct such errors. 

Despite these limitations, the proposed algorithm would eliminate numerous 
typographical errors that occur 1n the subject headings of the OCLC catalog 
records. Furthermore, this error-correcting algorithm will work with any 
field (e.g., authbr) for which an authority file can be created. 
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